Dataist Dogma

Reflections and projects in Data Science, Machine Learning and AI "A critical examination of the Dataist dogma is likely to be not only the greatest scientific challenge of the twenty-first century, but also the most urgent political and economic project" - Yuval Noah Harari - Homo Deus: a Brief History of the Future (2016)

Recommender system: Australian suburbs using clustering and the Foursquare API


This project utilizes publicly available data to recommend 10 similar suburbs in Australia to a given suburb. This could be useful for people trying to decide where to work, rent or buy property. For example, it could be used on property search websites; a person searches for property in a given suburb and is provided a list of similar suburbs to consider buying in. There are three primary data points used to determine how similar other suburbs are to the given suburb;

  1. Foursquare Places API - current data on the venues (cafes, restaurants, parks, amenities etc) most often found in the suburb
  2. Australian Bureau of Statistics API – data on the cultural background (i.e. ancestry) of the people in the suburb
  3. RealestateVIEW.com.au – Age and income demographics of the suburb

For example, given the suburb ‘Brunswick West’ (my home suburb in Melbourne) the engine provides 10 suburbs which have similar venues (typically cafes, pubs and Middle Eastern restaurants), similar age and income residents (older middle class), and similar cultural heritage (English, Italian, Greek).

It utilizes K-Means clustering from Scikit-learn to cluster venue and cultural data, the pandasdmx package to interface with the Australian Bureau of Statistics API, and the Beautiful Soup package to scrape demographic information from the web.

In [9]:
#pandasdmx is used when calling the Australian Bureau of Statistics API
!pip install pandasdmx
Collecting pandasdmx
  Downloading https://files.pythonhosted.org/packages/e7/5d/7e29616b0376d0a15eb3013c909c4e9021b6c6e244ef1b48817b4d0c4d46/pandaSDMX-0.9-py2.py3-none-any.whl (45kB)
     |████████████████████████████████| 51kB 13.8MB/s eta 0:00:01
Requirement already satisfied: lxml in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pandasdmx) (4.3.1)
Requirement already satisfied: pandas in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pandasdmx) (0.24.1)
Collecting jsonpath-rw (from pandasdmx)
  Downloading https://files.pythonhosted.org/packages/71/7c/45001b1f19af8c4478489fbae4fc657b21c4c669d7a5a036a86882581d85/jsonpath-rw-1.4.0.tar.gz
Requirement already satisfied: setuptools in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pandasdmx) (40.8.0)
Requirement already satisfied: requests in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pandasdmx) (2.21.0)
Requirement already satisfied: python-dateutil>=2.5.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pandas->pandasdmx) (2.7.5)
Requirement already satisfied: numpy>=1.12.0 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pandas->pandasdmx) (1.15.4)
Requirement already satisfied: pytz>=2011k in /opt/conda/envs/Python36/lib/python3.6/site-packages (from pandas->pandasdmx) (2018.9)
Requirement already satisfied: ply in /opt/conda/envs/Python36/lib/python3.6/site-packages (from jsonpath-rw->pandasdmx) (3.11)
Requirement already satisfied: decorator in /opt/conda/envs/Python36/lib/python3.6/site-packages (from jsonpath-rw->pandasdmx) (4.3.2)
Requirement already satisfied: six in /opt/conda/envs/Python36/lib/python3.6/site-packages (from jsonpath-rw->pandasdmx) (1.12.0)
Requirement already satisfied: idna<2.9,>=2.5 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->pandasdmx) (2.8)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->pandasdmx) (2019.6.16)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->pandasdmx) (3.0.4)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->pandasdmx) (1.24.1)
Building wheels for collected packages: jsonpath-rw
  Building wheel for jsonpath-rw (setup.py) ... done
  Stored in directory: /home/dsxuser/.cache/pip/wheels/5c/00/9a/82822db383c2d96dcebf839786665a185f92d37e5026f9806f
Successfully built jsonpath-rw
Installing collected packages: jsonpath-rw, pandasdmx
Successfully installed jsonpath-rw-1.4.0 pandasdmx-0.9
In [10]:
import pandas as pd
import numpy as np
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import requests
from sklearn.cluster import KMeans
from pandasdmx import Request
from bs4 import BeautifulSoup

Get suburb and postcode data

List of Australian postcodes and suburbs are available online on many websites. I've replicated the file from http://www.corra.com.au/australian-postcode-location-data/ on Github

In [11]:
filename = "https://raw.githubusercontent.com/DataistDogma/Coursera_Capstone/master/Australian_Post_Codes_Lat_Lon/Australian_Post_Codes_Lat_Lon.csv"
postcodes = pd.read_csv(filename)
print("Postcodes shape: ", postcodes.shape)
postcodes.head()
Postcodes shape:  (16080, 7)
Out[11]:
postcode suburb state dc type lat lon
0 200 AUSTRALIAN NATIONAL UNIVERSITY ACT AUSTRALIAN NATIONAL UNI LPO Post Office Boxes -35.277272 149.117136
1 221 BARTON ACT NaN LVR -35.201372 149.095065
2 800 DARWIN NT DARWIN DELIVERY CENTRE Delivery Area -12.801028 130.955789
3 801 DARWIN NT DARWIN DELIVERY CENTRE Post Office Boxes -12.801028 130.955789
4 804 PARAP NT PARAP Post Office Boxes -12.432181 130.843310

Each row represents a Distribution Centre (dc) rather than a suburb or postcode. We're looking for a list of unique suburbs, so we'll filter the dataframe for unique suburbs and remove the 'dc' and 'type' column, which we don't need

In [12]:
postcodes.drop_duplicates(subset='suburb', inplace=True)
postcodes.drop(['dc', 'type'],axis=1,inplace=True )
postcodes.dropna(inplace=True) #drop the trailing row which has no data
postcodes.reset_index(inplace=True, drop=True) #reinex after having removed some rows above
postcodes.shape
Out[12]:
(14269, 5)

For simplicity we'll also create a smaller subset of suburbs, being those in Metropolitan Melbourne, the second-largest city in Australia. The postcode range for Melbourne are postcode between 3000-3207 & 8000-8499

In [13]:
melb_postcodes = postcodes[(postcodes['postcode']>=3000)&(postcodes['postcode']<=3207)|(postcodes['postcode']>=8000)&(postcodes['postcode']<=8499)]
melb_postcodes.head()
Out[13]:
postcode suburb state lat lon
4901 3000 MELBOURNE VIC -37.814563 144.970267
4902 3002 EAST MELBOURNE VIC -37.816640 144.987811
4903 3003 WEST MELBOURNE VIC -37.806255 144.941123
4904 3005 WORLD TRADE CENTRE VIC -37.822262 144.954856
4905 3006 SOUTHBANK VIC -37.823258 144.965926

Assign venue clusters

Use the Foursquare API to get venue recommendations for each suburb and then use K-means to assign a venue cluster to the suburb based on the recommendations returned

In [14]:
#intialise Foursquare API credentials
CLIENT_ID = 'EKC0OWGJC1SY1AE1UHB4PUPH2JGARZTQK1U5C1USTUNA43JF' # your Foursquare ID
CLIENT_SECRET = 'CHH0EPUEN2PH4WBDV4XHTDT5NWTUZ2SYVIQALUOZXWYNABRO' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
In [15]:
#Use the function to extract the category from the dataframe (because the column name could be either 'categories' or 'venue.categories')
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
In [16]:
#define a function to return all the venues for a given neighborhood

#pass the neighborhood (nb), the latitude (la), the radius and the limit 
def get_venues(nb, la, lo, radius, limit):
    
    #print(nb)
    
    #form the request url and request only the items (which are the venues)
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, la, lo, VERSION, limit)
    items = requests.get(url).json()["response"]['groups'][0]['items']
    
    #include exception handling where Foursquare request fails for a particular Neighbourhood
    if items == []:
        print(" -- Foursquare request for ",nb,"returned no results --")
        return None
    else:
        # flatten JSON, filter for only wanted columns then use the get_category_type funtion to replace the category list with just the category.
        venues1 = json_normalize(items)
        filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
        venues1 =venues1.loc[:, filtered_columns]
        venues1['venue.categories'] = venues1.apply(get_category_type, axis=1)

        #define a new dataframe with the Neighbourhood information
        venues2 = pd.DataFrame(columns=['Suburb','Suburb Latitude', 
                      'Suburb Longitude']) 

        #for each of the venues returned, add the venue dataframe to the (empty) neighbourhood dataframe and fill all rows in the the neighbourhood columns with the neighborhood name, the neighbourhood latitude and longitude
        i=0
        for row in venues1:
            venues2[row]=venues1[row]
            venues2['Suburb']=nb
            venues2['Suburb Latitude']=la
            venues2['Suburb Longitude']=lo
            i=i+1

        #rename the columns
        venues2.rename(index=str,columns={"venue.name":"Venue","venue.categories":"Venue Category","venue.location.lat":"Venue Latitude","venue.location.lng":"Venue Longitude"}, inplace=True)
    
    return venues2
In [17]:
#test out the get_venues function
row = 300
n = postcodes.loc[row]['suburb']
n_lat = postcodes.loc[row]['lat']
n_long = postcodes.loc[row]['lon']

df2=get_venues(n, n_lat, n_long, 500, 100)
df2.head()
Out[17]:
Suburb Suburb Latitude Suburb Longitude Venue Venue Category Venue Latitude Venue Longitude
0 LANE COVE WEST -33.806246 151.153281 Lane Cove Aquatic Centre Pool -33.813591 151.171442
1 LANE COVE WEST -33.806246 151.153281 Via Napoli Pizzeria Pizza Place -33.815597 151.169879
2 LANE COVE WEST -33.806246 151.153281 Lane Cove National Park Park -33.792008 151.152327
3 LANE COVE WEST -33.806246 151.153281 North Shore Gym Gym -33.814679 151.169105
4 LANE COVE WEST -33.806246 151.153281 Fourno Café -33.795317 151.143533
In [18]:
#Iterate thorough all the suburbs of the dataframe df, adding each set of venues to the df_suburb_venues dataframe
df=melb_postcodes
df_suburb_venues=pd.DataFrame()
radius = 500
limit = 100
print("Getting venue recommendations, please wait......")
for i, row in enumerate(df['suburb']):
        df_suburb_venues=df_suburb_venues.append(get_venues(df.iloc[i]['suburb'], df.iloc[i]['lat'], df.iloc[i]['lon'], radius, limit),ignore_index=True)
print("...... Complete")
Getting venue recommendations, please wait......
...... Complete
In [19]:
counts = df_suburb_venues['Venue Category'].value_counts()
counts.head(10)
Out[19]:
Café                    6132
Fast Food Restaurant    1723
Grocery Store           1455
Supermarket             1426
Sandwich Place          1018
Coffee Shop              917
Shopping Mall            800
Park                     794
Pizza Place              774
Pub                      773
Name: Venue Category, dtype: int64
In [20]:
#Prepare the data for clustering using one-hot encoding
df_onehot = pd.get_dummies(df_suburb_venues['Venue Category'])
#insert a new column for the Neighbourhood values (note that immediately reusing the 'Neighborhood' name causes as error as you can't add a Neighourhood column that already exists, so renamed the column post the insert)
df_onehot.insert(0,"New",df_suburb_venues['Suburb'])
df_onehot.rename(columns={'New':'Suburb'}, inplace=True)

#get the mean frequency occurance
df_onehot = df_onehot.groupby("Suburb").mean().reset_index()
df_onehot.head()
Out[20]:
Suburb Accessories Store Adult Boutique Afghan Restaurant African Restaurant Airport Airport Lounge Airport Service Airport Terminal American Restaurant ... Wine Bar Wine Shop Winery Wings Joint Women's Store Xinjiang Restaurant Yoga Studio Yunnan Restaurant Zoo Zoo Exhibit
0 ABECKETT STREET 0.0 0.0 0.000000 0.0 0.00000 0.0 0.0 0.0 0.0 ... 0.010000 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 ABERFELDIE 0.0 0.0 0.000000 0.0 0.00000 0.0 0.0 0.0 0.0 ... 0.000000 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 AIRPORT WEST 0.0 0.0 0.000000 0.0 0.01087 0.0 0.0 0.0 0.0 ... 0.000000 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 ALBANVALE 0.0 0.0 0.010417 0.0 0.00000 0.0 0.0 0.0 0.0 ... 0.010417 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 ALBERT PARK 0.0 0.0 0.000000 0.0 0.00000 0.0 0.0 0.0 0.0 ... 0.010000 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 338 columns

In [21]:
#put the top10 into a dataframe
df_topvenues=pd.DataFrame(columns=['Suburb','#1','#2','#3','#4','#5','#6','#7','#8','#9','#10'])

for row in range(0,len(df_onehot)):
    suburb = df_onehot.iloc[row][0]
    topten = df_onehot.iloc[row][1:].sort_values(ascending=False).head(10)
    temp=pd.DataFrame([[suburb,topten.index[0],topten.index[1],topten.index[2],topten.index[3],topten.index[4],topten.index[5],topten.index[6],topten.index[7],topten.index[8],topten.index[9]]],columns=['Suburb','#1','#2','#3','#4','#5','#6','#7','#8','#9','#10'])
    df_topvenues=df_topvenues.append(temp)
df_topvenues.reset_index(inplace=True, drop=True)
df_topvenues.head()
Out[21]:
Suburb #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
0 ABECKETT STREET Korean Restaurant Café Coffee Shop Japanese Restaurant Bar Bubble Tea Shop Indonesian Restaurant Donut Shop Burger Joint Fast Food Restaurant
1 ABERFELDIE Café Electronics Store Grocery Store Gym Bakery Coffee Shop Shopping Mall Pub Japanese Restaurant Pizza Place
2 AIRPORT WEST Fast Food Restaurant Supermarket Café Sandwich Place Grocery Store Portuguese Restaurant Convenience Store Middle Eastern Restaurant Shopping Mall Electronics Store
3 ALBANVALE Fast Food Restaurant Portuguese Restaurant Café Supermarket Grocery Store Bakery Shopping Mall Vietnamese Restaurant Pizza Place Chinese Restaurant
4 ALBERT PARK Café Pub Japanese Restaurant Gastropub Burger Joint Coffee Shop Beach Bakery Fish & Chips Shop Breakfast Spot

Assign a cluster based on the recommendations using the k-means clustering algorithm from Scikit Learn

In [22]:
# import k-means
from sklearn.cluster import KMeans

df_clustering = df_onehot.drop('Suburb',axis=1)

# set number of clusters
k = 5

# run k-means clustering
kmeans = KMeans(n_clusters=k, random_state=0).fit(df_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:100]
Out[22]:
array([0, 3, 4, 4, 3, 4, 3, 1, 3, 3, 2, 3, 4, 3, 3, 3, 3, 1, 1, 1, 2, 4,
       3, 4, 3, 3, 3, 3, 4, 1, 1, 2, 2, 3, 2, 1, 1, 1, 1, 1, 2, 1, 1, 3,
       1, 3, 3, 1, 1, 1, 2, 2, 1, 1, 3, 1, 1, 1, 1, 1, 3, 3, 3, 3, 4, 0,
       0, 0, 3, 0, 0, 1, 2, 3, 3, 4, 4, 2, 4, 4, 3, 3, 3, 3, 4, 0, 0, 1,
       4, 2, 2, 3, 3, 2, 3, 3, 3, 1, 1, 3], dtype=int32)
In [23]:
df_topvenues.insert(0,'Venues Cluster',kmeans.labels_)
In [24]:
df_merged = postcodes

df_merged = df_merged.join(df_topvenues.set_index('Suburb'), on='suburb')

#drop rows with no clusters
df_melbourne_merged = df_merged.dropna(subset=['Venues Cluster'])
df_melbourne_merged.head()
Out[24]:
postcode suburb state lat lon Venues Cluster #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
4901 3000 MELBOURNE VIC -37.814563 144.970267 0.0 Cocktail Bar Italian Restaurant Coffee Shop Café Japanese Restaurant Hotel Asian Restaurant Sandwich Place French Restaurant BBQ Joint
4902 3002 EAST MELBOURNE VIC -37.816640 144.987811 0.0 Café Vietnamese Restaurant Cricket Ground Tennis Stadium Park Breakfast Spot Bakery Vegetarian / Vegan Restaurant Football Stadium Thai Restaurant
4903 3003 WEST MELBOURNE VIC -37.806255 144.941123 0.0 Café Korean Restaurant Bar Hotel Middle Eastern Restaurant Sandwich Place Shopping Mall Bakery Burger Joint Asian Restaurant
4904 3005 WORLD TRADE CENTRE VIC -37.822262 144.954856 0.0 Café Japanese Restaurant Coffee Shop Hotel Bar French Restaurant Steakhouse Restaurant Italian Restaurant Australian Restaurant
4905 3006 SOUTHBANK VIC -37.823258 144.965926 0.0 Bar Theater Hotel Italian Restaurant Performing Arts Venue Park Japanese Restaurant Café Australian Restaurant Burger Joint

Assign cultural clusters

Use the Australian Beaureu of Statistics API to get a breakdown on the ancestry for the suburb suburb and then use K-means to assign a venue cluster suburbs with similar ancestry

In [25]:
#create the url which returns the ABS cencus data for the year 2011 for the question T09 - Ancestry by birthplace of parents
allvic_url='ABS_CENSUS2011_T09/TOT+1+2+3+4+Z.TOT+TOTP+1101+1102+6101+3204+2303+2101+5201+2305+2306+3205+3304+7106+2201+3103+6902+4106+3206+3104+1201+1202+3307+3308+2102+3213+7115+9215+3106+4907+5107+2103+OTH+Z.2.SA2..A/all?detail=Full&dimensionAtObservation=AllDimensions'
In [26]:
#create request object
abs = Request('ABS')
#create response object
response = abs.data(allvic_url)
#create dataset object
dataset=response.data
##create top level series
data=response.write()
data.head()
Out[26]:
MEASURE  ANCP  STATE  REGIONTYPE  REGION     FREQUENCY  TIME_PERIOD
TOT      TOT   2      SA2         201011001  A          2001            7543.0
                                                        2006           10219.0
                                                        2011           10969.0
                                  205051101  A          2001           15263.0
                                                        2006           16144.0
dtype: float64

The data above is arranged in a single multi index column (with region being the postcode code and ANCP being the ancestry country code), so we need to pull the indexes into columns and then we can access the columns we need more easily:

In [27]:
#using reset_index removes all the levels of a multi-index column by default
df_ancestry = data.reset_index(name='Value')
print(df_ancestry.shape)
df_ancestry.head()
(265608, 8)
Out[27]:
MEASURE ANCP STATE REGIONTYPE REGION FREQUENCY TIME_PERIOD Value
0 TOT TOT 2 SA2 201011001 A 2001 7543.0
1 TOT TOT 2 SA2 201011001 A 2006 10219.0
2 TOT TOT 2 SA2 201011001 A 2011 10969.0
3 TOT TOT 2 SA2 205051101 A 2001 15263.0
4 TOT TOT 2 SA2 205051101 A 2006 16144.0
In [28]:
#Take only 2011 data
df_ancestry = df_ancestry[(df_ancestry['TIME_PERIOD']=='2011')]

#The data includes six different measures of ancestry (father only, mother only, both).
#in this case the only measure i'm interested in the totals for each ancestry
df_ancestry = df_ancestry[(df_ancestry['MEASURE']=='TOT')&(df_ancestry['ANCP']!='TOT')]

#pivot the Ancestry to columns in preparation for k-means analysis
df_ancestry = df_ancestry.pivot(index='REGION',columns='ANCP', values='Value')

df_ancestry.head()
Out[28]:
ANCP 1101 1102 1201 1202 2101 2102 2103 2201 2303 2305 ... 5107 5201 6101 6902 7106 7115 9215 OTH TOTP Z
REGION
201011001 3340.0 6.0 13.0 46.0 3387.0 940.0 54.0 1167.0 253.0 17.0 ... 0.0 30.0 109.0 3.0 97.0 4.0 33.0 319.0 8195.0 373.0
201011002 4643.0 11.0 20.0 38.0 4979.0 1609.0 86.0 2201.0 293.0 43.0 ... 4.0 36.0 201.0 9.0 159.0 5.0 35.0 583.0 12398.0 848.0
201011003 8635.0 32.0 22.0 70.0 8847.0 2535.0 137.0 3537.0 571.0 83.0 ... 12.0 53.0 216.0 7.0 102.0 9.0 39.0 757.0 21516.0 1276.0
201011004 9146.0 48.0 36.0 107.0 9169.0 2531.0 147.0 2969.0 683.0 98.0 ... 22.0 74.0 308.0 0.0 156.0 0.0 35.0 912.0 22579.0 1713.0
201011005 2554.0 6.0 8.0 39.0 2939.0 861.0 36.0 1103.0 236.0 41.0 ... 0.0 19.0 52.0 4.0 40.0 9.0 27.0 240.0 6583.0 279.0

5 rows × 33 columns

In [29]:
# set number of clusters
k = 10

# run k-means clustering
kmeans = KMeans(n_clusters=k, random_state=0).fit(df_ancestry)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:100]
Out[29]:
array([5, 2, 0, 0, 1, 1, 8, 2, 1, 1, 5, 1, 8, 8, 8, 1, 1, 2, 5, 2, 5, 4,
       8, 1, 5, 1, 5, 5, 8, 5, 1, 8, 1, 8, 1, 1, 2, 0, 2, 4, 4, 4, 2, 5,
       2, 5, 2, 5, 1, 4, 1, 8, 2, 1, 1, 2, 5, 8, 1, 8, 8, 8, 5, 8, 8, 4,
       5, 8, 2, 8, 8, 1, 2, 0, 8, 2, 1, 1, 4, 8, 2, 1, 8, 5, 1, 1, 5, 8,
       5, 5, 5, 8, 0, 5, 4, 2, 0, 8, 8, 8], dtype=int32)
In [30]:
df_ancestry.insert(0,'Ancestry Cluster',kmeans.labels_)
df_ancestry.head()
Out[30]:
ANCP Ancestry Cluster 1101 1102 1201 1202 2101 2102 2103 2201 2303 ... 5107 5201 6101 6902 7106 7115 9215 OTH TOTP Z
REGION
201011001 5 3340.0 6.0 13.0 46.0 3387.0 940.0 54.0 1167.0 253.0 ... 0.0 30.0 109.0 3.0 97.0 4.0 33.0 319.0 8195.0 373.0
201011002 2 4643.0 11.0 20.0 38.0 4979.0 1609.0 86.0 2201.0 293.0 ... 4.0 36.0 201.0 9.0 159.0 5.0 35.0 583.0 12398.0 848.0
201011003 0 8635.0 32.0 22.0 70.0 8847.0 2535.0 137.0 3537.0 571.0 ... 12.0 53.0 216.0 7.0 102.0 9.0 39.0 757.0 21516.0 1276.0
201011004 0 9146.0 48.0 36.0 107.0 9169.0 2531.0 147.0 2969.0 683.0 ... 22.0 74.0 308.0 0.0 156.0 0.0 35.0 912.0 22579.0 1713.0
201011005 1 2554.0 6.0 8.0 39.0 2939.0 861.0 36.0 1103.0 236.0 ... 0.0 19.0 52.0 4.0 40.0 9.0 27.0 240.0 6583.0 279.0

5 rows × 34 columns

The table above refers to REGION and ANCP (ancestry) by codes. We need to get the xml from the ABS which describes what each code value is and and use it to merge the name of the suburb back into the data.

In [31]:
#an xml parser is required to get the codes from the Australian Bureau of Statistics xml file
import xml.etree.ElementTree as ET
In [32]:
#the xml file of codes from the ABS
user_agent_url = 'http://stat.data.abs.gov.au/restsdmx/sdmx.ashx/GetDataStructure/ABS_CENSUS2011_T09'
xml_data = requests.get(user_agent_url).content

#create the xml tree from the file
tree = ET.fromstring(xml_data)
In [33]:
#Create a dataframe which will include all the codes
df_codes = pd.DataFrame(columns=['code','value'])

#parse the structure xml to get the codes:
for child in tree:
    for lower in child:
        for codelist in lower:
            for description in codelist:
                if(description.attrib.get('{http://www.w3.org/XML/1998/namespace}lang')=='en'):
                    code = codelist.attrib.get('value')
                    value = description.text
                    temp_df = pd.DataFrame([[code,value]],columns=['code','value'])
                    df_codes=pd.concat([df_codes,temp_df])

#set the code as the index of the table
df_codes.set_index('code', inplace=True)

#strip whitespaces from the values column
df_codes['value'] = df_codes['value'].str.strip()
df_codes.head(12)
Out[33]:
value
code
TOT Total responses(c)
1 Both parents born overseas
2 Father only born overseas
3 Mother only born overseas
4 Both parents born in Australia
Z Birthplace not stated(b)
TOT Total responses(c)
TOTP Total persons(c)
1101 Australian
1102 Australian Aboriginal
6101 Chinese
3204 Croatian
In [34]:
#merge the codes back into the Ancestrt dataframe
df_ancestry = df_ancestry.join(df_codes)

We now have a dataframe of Suburbs with Ancestry clusters. We'll clean this dataframe up as well as the original dataframe containing the Venue Cluster so that we can merge them in the subsequent step on suburb name.

In [35]:
# select only the required columns, fix the names and make suburb lowercase for searching
df_ancestry = df_ancestry[['value','Ancestry Cluster']]
df_ancestry.rename(columns={"value":"suburb"}, inplace=True)

#strip out whitespace, change to lowercase
df_ancestry['suburb'] = df_ancestry['suburb'].str.lower()
df_melbourne_merged['suburb']=df_melbourne_merged['suburb'].str.lower()
df_ancestry.sort_values(by=['suburb'], inplace=True)
df_melbourne_merged.sort_values(by=['suburb'], inplace=True)
/opt/conda/envs/Python36/lib/python3.6/site-packages/ipykernel/__main__.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/opt/conda/envs/Python36/lib/python3.6/site-packages/ipykernel/__main__.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [36]:
#merge on suburb
df_complete = df_melbourne_merged.merge(df_ancestry, how='left', on='suburb')

#and pick only relevant columns
df_complete = df_complete[['postcode','suburb','state','Venues Cluster','Ancestry Cluster']]

#and see how many suburbs didn't find a matching Ancestry Cluster
df_complete[pd.isnull(df_complete['Ancestry Cluster'])]['suburb'].shape
Out[36]:
(296,)

The above shows that 296 of our suburbs did not have a entry in the Ancestry dataframe when we merge directly on Suburb name. We'll need to do some looser matching which gets this number down to 93. Note that the remaining 93 are not really suburbs - they are standalone postcodes used by Australia Post

In [37]:
#where suburb couldn't be matched directly, try to find the suburb name as a substing in df_ancestry:
for suburb in df_complete[pd.isnull(df_complete['Ancestry Cluster'])]['suburb']:
    if(len(df_ancestry.loc[df_ancestry['suburb'].str.contains(suburb)]['Ancestry Cluster'])>0):
        df_complete.loc[df_complete[df_complete['suburb']==suburb].index,'Ancestry Cluster']=df_ancestry.loc[df_ancestry['suburb'].str.contains(suburb)]['Ancestry Cluster'][0]

# where there is still no macth, try to split the suburb and find the sub string. This will allow, for example, 'Altona North' to get the cluster for 'Altona'
for suburb in df_complete[pd.isnull(df_complete['Ancestry Cluster'])]['suburb']:
    suburb2 = suburb.split()[0].strip()
    if(len(df_ancestry.loc[df_ancestry['suburb'].str.contains(suburb2)]['Ancestry Cluster'])>0):
        df_complete.loc[df_complete[df_complete['suburb']==suburb].index,'Ancestry Cluster']=df_ancestry.loc[df_ancestry['suburb'].str.contains(suburb2)]['Ancestry Cluster'][0]


#and see how many suburbs didn't find a matching Ancestry Cluster
df_complete[pd.isnull(df_complete['Ancestry Cluster'])]['suburb'].shape
Out[37]:
(93,)
In [38]:
#all the suburbs without an Ancestry Cluster are not actually recognisable suburbs, so we can drop these:
df_complete.dropna(inplace=True)
In [39]:
df_complete.head()
Out[39]:
postcode suburb state Venues Cluster Ancestry Cluster
1 3040 aberfeldie VIC 3.0 0.0
2 3042 airport west VIC 4.0 1.0
4 3206 albert park VIC 3.0 2.0
5 3020 albion VIC 4.0 1.0
6 3078 alphington VIC 3.0 1.0

Assign age and income brackets

Most real estate search tools offer suburb profiles which include information prospective buyers/renters would typically be interested in, such as recent sold prices as well as demographic information. This project will utilize limited demographic information scraped from RealestateVIEW.com.au; Median Age and Median Weekly Income.

In [40]:
#define a function which takes a suburb name and scrapes the web for median age and household income statistics
def get_age_income(state,suburb):
    
    #Format the url
    url='https://propertydata.realestateview.com.au/propertydata/suburb-profile/'+state+'/'
    suburb = suburb.split()
    i=0
    for i in suburb:
        url=url+i+"+"
    
    age = 0
    income = 0

    #get the page text for the suburb
    page = requests.get(url).text
    soup = BeautifulSoup(page, 'lxml')
    
    #find all the important elements
    sp = soup.find_all('b')
    
    #get the median age and income for the suburb
    for element in sp:
        if element.text=='Median Age':
            age = element.next_sibling.next_sibling.translate({ord(i): None for i in '\n\t'})
        if element.text=='Weekly Household Income':
            income = element.next_sibling.next_sibling.translate({ord(i): None for i in '\n\t'})
        
    return age, income
In [41]:
#insert columns for median age and weekly income
df_complete.insert(4,'Age',0)
df_complete.insert(4,'Income',0)
In [42]:
#fetch Median Age and Weekly Income from https://propertydata.realestateview.com.au/
print('Getting Median Age and Weekly Income from https://propertydata.realestateview.com.au/, please wait...')
for row in df_complete.itertuples():
    result = get_age_income('victoria',row.suburb)
    df_complete.loc[df_complete['suburb']==row.suburb,'Age']=result[0]
    df_complete.loc[df_complete['suburb']==row.suburb,'Income']=result[1]
print('...complete')
Getting Median Age and Weekly Income from https://propertydata.realestateview.com.au/, please wait...
...complete
In [43]:
df_complete.head()
Out[43]:
postcode suburb state Venues Cluster Income Age Ancestry Cluster
1 3040 aberfeldie VIC 3.0 2200.00 40 0.0
2 3042 airport west VIC 4.0 1375.00 40 1.0
4 3206 albert park VIC 3.0 2263.00 43 2.0
5 3020 albion VIC 4.0 1010.00 33 1.0
6 3078 alphington VIC 3.0 2123.00 38 1.0

We now have a complete list of suburbs with clusters assigned based on the venues in the area and the ancestry of the people, as well as the relative age and income of the inhabitants

Recommend a Suburb

In [44]:
#postcode and state are not required for the recommendation calculation
similar_sort = df_complete.drop(['postcode','state'], axis=1)

#Update the index and clean up the column types
similar_sort.set_index('suburb', inplace=True)
similar_sort['Age']=similar_sort.Age.astype(float)
similar_sort['Income']=similar_sort.Income.astype(float)

#normalize the data into a new dataframe
from sklearn.preprocessing import MinMaxScaler
x = similar_sort.values #returns a numpy array
min_max_scaler = MinMaxScaler()
feature_mtx = min_max_scaler.fit_transform(x)
feature_mtx [0:5]

df_normal = pd.DataFrame(feature_mtx, index=similar_sort.index, columns=similar_sort.columns)
In [45]:
# The following function takes in a suburb and returns the top 10 similar suburbs
def subrec(suburb):
    
    try:
        #choose a suburb for comparison
        new_suburb = suburb.lower()
        new_suburb_df = list(df_normal.loc[new_suburb])

        #get the absolute difference between the new suburb and all other suburbs
        temp_df = df_normal-new_suburb_df
        temp_df = temp_df.abs()
        recommendations = temp_df.sum(axis=1).sort_values().head(10)

        return recommendations

    except:
        print("Error: Suburb not found")
    

Using the function for a given suburb gives the similarity scores for the top 10 similar suburbs

In [46]:
suburb = 'Prahran'
results = subrec(suburb)
results
Out[46]:
suburb
prahran           0.000000
fitzroy north     0.194010
south yarra       0.254777
hawthorn east     0.286826
yallambie         0.296024
clifton hill      0.320766
coburg            0.337025
malvern east      0.339021
princes hill      0.393546
brunswick east    0.397217
dtype: float64

We can see more details for the top 10 including their feature values:

In [47]:
detailed_results = df_complete[df_complete['suburb'].isin(results.index)]
detailed_results
Out[47]:
postcode suburb state Venues Cluster Income Age Ancestry Cluster
66 3057 brunswick east VIC 0.0 1726.00 32 1.0
108 3068 clifton hill VIC 0.0 2272.00 36 5.0
110 3058 coburg VIC 0.0 1638.00 35 6.0
159 3068 fitzroy north VIC 0.0 1920.00 36 5.0
185 3123 hawthorn east VIC 0.0 1976.00 34 2.0
247 3145 malvern east VIC 1.0 1937.00 36 4.0
311 3181 prahran VIC 0.0 1850.00 33 4.0
317 3054 princes hill VIC 0.0 1857.00 36 1.0
350 3141 south yarra VIC 0.0 1813.00 32 2.0
414 3085 yallambie VIC 1.0 1871.00 35 4.0

Map the results

In [2]:
!pip install folium
Collecting folium
  Downloading https://files.pythonhosted.org/packages/4f/86/1ab30184cb60bc2b95deffe2bd86b8ddbab65a4fac9f7313c278c6e8d049/folium-0.9.1-py2.py3-none-any.whl (91kB)
     |████████████████████████████████| 92kB 14.3MB/s eta 0:00:01
Requirement already satisfied: requests in /opt/conda/envs/Python36/lib/python3.6/site-packages (from folium) (2.21.0)
Requirement already satisfied: jinja2>=2.9 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from folium) (2.10)
Requirement already satisfied: numpy in /opt/conda/envs/Python36/lib/python3.6/site-packages (from folium) (1.15.4)
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/63/36/1c93318e9653f4e414a2e0c3b98fc898b4970e939afeedeee6075dd3b703/branca-0.3.1-py3-none-any.whl
Requirement already satisfied: idna<2.9,>=2.5 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->folium) (2.8)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->folium) (2019.6.16)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->folium) (3.0.4)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from requests->folium) (1.24.1)
Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/envs/Python36/lib/python3.6/site-packages (from jinja2>=2.9->folium) (1.1.0)
Requirement already satisfied: six in /opt/conda/envs/Python36/lib/python3.6/site-packages (from branca>=0.3.0->folium) (1.12.0)
Installing collected packages: branca, folium
Successfully installed branca-0.3.1 folium-0.9.1
In [3]:
import folium # map rendering library
In [5]:
latitude=-37.814563
longitude=144.970267
print(latitude,longitude)
-37.814563 144.970267
In [54]:
detailed_results = df_melbourne_merged[df_melbourne_merged['suburb'].isin(results.index)]
In [55]:
detailed_results
Out[55]:
postcode suburb state lat lon Venues Cluster #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
5016 3057 brunswick east VIC -37.764910 144.979567 0.0 Café Bar Bakery Pizza Place Mexican Restaurant Beer Garden Vietnamese Restaurant Italian Restaurant Grocery Store Gastropub
5036 3068 clifton hill VIC -37.788118 144.992067 0.0 Café Bakery Pub Vegetarian / Vegan Restaurant Italian Restaurant Bar Coffee Shop Gastropub Cocktail Bar Ice Cream Shop
5018 3058 coburg VIC -37.743188 144.966279 0.0 Café Bar Bakery Middle Eastern Restaurant Grocery Store Vietnamese Restaurant Beer Garden Thai Restaurant Pizza Place Mexican Restaurant
5037 3068 fitzroy north VIC -37.783400 144.984688 0.0 Café Bar Pub Bakery Italian Restaurant Vegetarian / Vegan Restaurant Coffee Shop Ice Cream Shop Gastropub Wine Shop
5120 3123 hawthorn east VIC -37.782254 145.001811 0.0 Café Bakery Vegetarian / Vegan Restaurant Pub Pizza Place Indian Restaurant Gastropub Italian Restaurant Bar Coffee Shop
5184 3145 malvern east VIC -37.878370 145.067892 1.0 Café Korean Restaurant Park Supermarket Electronics Store Breakfast Spot Chinese Restaurant Thai Restaurant Malay Restaurant Fast Food Restaurant
5253 3181 prahran VIC -37.849577 144.993714 0.0 Café Italian Restaurant Japanese Restaurant Bar Coffee Shop Pub Deli / Bodega Pizza Place Clothing Store French Restaurant
5008 3054 princes hill VIC -37.780995 144.962792 0.0 Café Bar Zoo Exhibit Italian Restaurant Bakery Pub Vietnamese Restaurant Dog Run Gastropub Grocery Store
5173 3141 south yarra VIC -37.837883 144.991123 0.0 Café Italian Restaurant French Restaurant Japanese Restaurant Bakery Hotel Dessert Shop Coffee Shop Pub Korean Restaurant
5071 3085 yallambie VIC -37.727482 145.102309 1.0 Café Fast Food Restaurant Grocery Store Supermarket Sandwich Place Convenience Store Pub Art Gallery Shopping Mall Burger Joint
In [70]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

colors_list = ['Red','Blue','Green','Yellow','Purple','Pink','Orange']

#add markers to the map
markers_colors = []
for lat, lon, suburb in zip(detailed_results['lat'], detailed_results['lon'], detailed_results['suburb']):
    label = folium.Popup(str(suburb), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        #color=colors_list[int(cluster)],
        fill=True,
       fill_opacity=0.7).add_to(map_clusters)

       
map_clusters
Out[70]: